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Abstract —Identifying anomalies and contamination in datasets 
is important in a wide variety of settings. In this paper, we 
describe a new technique for estimating contamination in large, 
discrete valued datasets. Our approach considers the normal 
condition of the data to be specified by a model consisting of 
a set of distributions. Our key contribution is in our approach 
to contamination estimation. Specifically, we develop a technique 
that identifies the minimum number of data points that must be 
discarded ( i.e the level of contamination) from an empirical 
data set in order to match the model to within a specified 
goodness-of-fit, controlled by a p-value. Appealing to results from 
large deviations theory, we show a lower bound on the level of 
contamination is obtained by solving a series of convex programs. 
Theoretical results guarantee the bound converges at a rate of 
0(y/log(p)/p), where p is the size of the empirical data set. 

Index terms: contamination estimation, anomaly detection, entropy 
minimization, discrete goodness-of-fit testing. 

I. Introduction 

Anomalies in datasets are typically associated with un¬ 
expected or unwanted characteristics such as contamination, 
noise or outliers that deviate significantly from expectations. 
The ability to detect anomalies and accurately estimate con¬ 
tamination in datasets is important in a wide variety of 
domains including healthcare, astronomy, environmental and 
materials sciences. The context that motivates our work is 
detecting anomalies and estimating contamination in datasets 
collected from communication and computer systems. Specific 
applications of anomaly detection in these datasets include 
network management and Internet security broadly defined. 
Communication and Internet measurement datasets have sev¬ 
eral distinguishing characteristics including the potential for 
extreme scale and high dimensionality. 

The standard framework for anomaly detection is based on 
establishing a baseline for normal ( e.g in a distributional 
sense) and then setting a threshold which if exceeded identifies 
an anomaly. The goal in establishing norms and thresholds 
is to identify anomalies with low false alarm rates. There is 
an extensive literature on methods for anomaly detection (see 
related work in Section [HI]). 

In this paper we describe a new method for anomaly detec¬ 
tion which is based on estimating the level of contamination in 
a dataset. An anomaly is declared if a dataset has an elevated 
level of contaminate. We consider the contamination-free {i.e., 
normal) condition of a dataset to be specified by a model 
comprised of a set of distributions. We then compare the model 
to the distributional profile of a target dataset collected over 
a specified period. A standard method for comparing datasets 


in this way is goodness of fit (GoF) testing 0- To the best of 
our knowledge, this paper is the first to address the problem of 
contamination estimation using GoF testing based on entropy 
minimization, as we define in Section |Il-B| 

The approach we develop is based on answering the fol¬ 
lowing question. Given a model consisting of a family of 
distributions, a specified p-value, and an empirical dataset, 
what is the minimum number of data points that must be 
discarded so that the empirical distribution of the data matches 
a member model distribution (in terms of GoF for a specified 
p-value)? This is akin to finding the largest subset of the 
original dataset which has an empirical distribution close to 
the model. We show that this question can be efficiently 
answered by solving a series of convex optimizations. Solving 
the optimizations results in a lower bound on the minimum 
number of data points that are attributed to a contaminate. In 
the simplest case, each convex optimization is an inequality 
constrained entropy minimization problem (whose dual is a 
constrained geometric program) which can be solved in real 
time and at scale for many applications. More generally, the 
approach can be applied to any setting in which the model 
consists of a convex set of distributions. Two specific instances 
which we discuss are 1) models defined by any number of 
distributions with arbitrary mixture proportions, and 2) models 
defined by the set distributions with small Kullback-Leibler 
(KL) divergence to a specified distribution, which arises when 
the model itself is generated from a finite amount of data. 
Lastly, we show the lower bound output by the optimization 
converges to an upper bound known as the separation distance 
at a rate of 0(yJlog(p)/p), where p is the number of data 
points. 

II. Quantifying Contamination 

A. Notation 

Let P £ and Q £ K" denote probability mass functions 
over n categories, with elements Pi, i = 1 ,...,n and Qi, 
i = 1 ,...,n. Throughout, P denotes the distribution under 
test, Q denotes a member distribution of the model, Q° 
denotes the ‘true’ unknown model distribution, and Q 1 indexes 
multiple distributions. The empirical distribution of a sequence 
of random variables X = X \,..., X p £ X v is the relative pro¬ 
portion of occurrences of each element of X in A'. Specifically, 
let A =: {x 1 ,x 2 ,...,x n } and define pi = Y%=i 1 {x j =x i } 
for i = 1 Then P(X) = ± {pi,P 2 , ■ ■ ■ ,Pnj ■ 

Pq(-) denotes probability measure with respect to distribution 






Q. For simplicity of notation, we write P q({P 1 ,P 2 }) as 
short hand for P Q (jx e XP ■ P( X ) e {^P 1 , P 2 }}) - The 
Kullback-Leibler divergence between two distributions is de¬ 
fined in the usual manner, 

D(P\\Q) :=E P * lo g(^)' 

D{P\\Q) is a jointly convex function in P and Q. The 
minimum entropy set, {P : D(P\\Q) < e}, is a convex set (for 
a fixed Q, e). Lastly, let S n denote the probability simplex: 

S n := |p S R” : Y^ p i = 1> p i > 0 i = l,...,n 

B. Quantifying Contamination 

Consider a set of model distributions Q whose elements 
are supported over a finite number of categories X with 
\X\ = n. For example, Q could be set of minimum entropy 
distributions, or a mixture distribution, Q = ^T^ =1 itjQf 
where tty ,..., ttn are unknown (Q is the set of all such mixture 
distributions). Let X £ X p denote a collection of samples. 
An unknown subset of the samples consists of i.i.d. draws 
from an unknown distribution Q € Q. The remaining samples, 
C C [p], are generated by some other means, and correspond 
to contaminated samples. This paper is concerned with lower 
bounding the size of the contaminating set C given the set 
of model distributions Q, a specified significance level (a p- 
value), and the observed samples Xy ,..., X p . 

Intuitively, if the empirical distribution of a sequence of 
random variables is close to the model distribution in terms 
of GoF, we conclude the sequence is not contaminated. To 
quantify this intuition, we define a set of typical empirical 
distributions based on statistical significance; we note this 
definition is distinct from the usual definitions of strongly and 
weakly typical, and making this connection is a contribution 
herein. 

Definition 1. Typical. Let P ] . P 2 ,... be an ordering on all 
empirical distributions (of p samples and n categories) such 
that Pq(P 1 ) < Pq(P 2 ) < .... A sequence of random 
variables X with P(X) = P l is typical at significance level 
e with respect to Q iff 

supP Q ({P\P 2 ,...,P £ -\P £ }) >e (1) 

QgQ J/ 

for any such ordering] 

The definition implies a sequence of random variables X 
is typical if the probability of the empirical distribution of 
X or any less likely empirical distribution is more than a 
specified significance level. Note e is interpreted as a />value; 
as e approaches zero, all sequences become typical (requiring 
stronger evidence to reject the null hypothesis). As e increases, 
fewer sequences are typical. 

'Note the ordering is an implicit function of Q; we suppress this for 
simplicity of notation. 


Definition 2. Contaminated. We say X is contaminated iff 
X is not typical (with respect to Q and with significance e). 
Likewise, an empirical distribution P(X) is contaminated iff 
X is not typical. 

In this paper we study the following question. Let X = 
Xy ,..., X p be a dataset, and let X = {X t : i £ C } be 
any subset of of the original dataset. What is the smallest set 
C C [p] such that x ^\c * s not contaminated ? Specifically, let 

c* = inf {\C\ : -X^g- is typical for (Q, e)j . 

How and under what conditions can one compute c* effi¬ 
ciently? Our main focus and insight will be on the continuous 
approximation to c*/p, denoted a*: 

a* = inf {a £ [0,1] : 3 P £ V(X, a) typical for (Q, e)} 


where V(X, a) is the set of all distributions that can be c reated 
by discarding a fraction a of the mass of P{X) (see Sec. 


II-Di: 


V(X,a) = {P £S n : P t < 


pm 

1 — a 


i = 1, ■ ■■ ,n , 


( 2 ) 


Throughout, a is a key parameter that represents the fraction 
of the dataset attributed to contamination; a* represents the 
smallest a such that there exists a subset of the original data of 
size p( 1 — a) that is not contaminated. If a* = 0, the original 
dataset is not contaminated; if a* = 1, the entire dataset must 
be attributed to contamination. 


C. Separation Distance 

We assume Xi Q° for all i ^ C. For Xi, i £ C, 
no assumption is made. This agnostic approach has inherent 
limitations. In the extreme case the distribution of the contam¬ 
inated data could exactly follow that of the model. Here, the 
distribution of the full dataset should closely match the model, 
and be indistinguishable from the setting where C is empty. 
No contamination should be reported to within the significance 
level (in to realizations of X p , we expect e* f 0 fewer than 
me times). 

A more interesting scenario is when the empirical distribu¬ 
tion of the full dataset converges to a distinct distribution i.e., 
P(X P ) —>■ P ^ Q°. In the case that Q = {<3°}, a consistent 
estimator will report non-zero contamination for large p. P can 
be written as a mixture distribution, and we are interested in 
reporting the smallest k such that (1 — k)Q° + kF = P for any 
distribution F. F represents the contaminating distribution, 
and k the proportion of the samples which are drawn from 
F. This minimum value of n is known as the separation 
distance |2j between P and Q°, written succinctly as 

n(P\\Q°) = max . 

*e[ra] V Qi J 

In this way, the separation distance between the empirical dis¬ 
tribution of the data and model distribution plays an important 
role in the behavior of c* and a* as the sample size grows. 
We show as a corollary to later results that a* is both upper 
bounded by and converges to n{P{X)\ |Q°) as p grows (see 
Proposition [T] and Theorem [6}. 




D. Convex Relaxations 

With the exception of problems involving data over only two 
categories (n = 2), directly checking if a sample is contami¬ 
nated is computationally prohibitive, even in the setting where 
the model consists of a single distribution (when Q = {Q°}). 
Alternatively, using large deviations results, bounds can be 
derived. The bound presented below can confirm if a partic¬ 
ular dataset is contaminated. The theorem involves the KL 
divergence between the empirical distribution and a member 
of Q. In the case where Q = {Q°}, the bound provides a 
simple way to check if a sample is contaminated at a particular 
significance level e; in the more general case, if Q is a convex 
set, numerical optimization techniques can efficiently check 
the condition. 


Theorem 1. (Outer Bound). If 

^ 1 /1 \ 2?i 

inf D(P{X)\\Q) > - log - ) + — log(p+ 1) (3) 

QeC p \eJ p 

then X is contaminated at significance level e. 


Proof: See Appendix A. ■ 

Theorem [T| is an outer bound; any empirical distribution 
with KL distance greater than the stated quantity (from all 
elements in Q ) is contaminated. Theorem [T] can be used to 
bound the size of the smallest set C C [p] such that ^[ P ]\c is 
not contaminated. This is simplified if Q consists of a single 
model distribution; we first discuss this scenario. In principle, 
given a dataset X £ X p and a model distribution Q°, one 
could first check if A' is contaminated by evaluating ([3]). If ([3]) 
holds, X is contaminated, and an immediate question follows 
- how many and which data points must be excluded so that 
Q no longer holds? A exhaustive approach to answer this 
question would be the following. For each Xi £ X , discard a 
single data point that takes the value Xi, and recalculate the 
empirical distribution with the data point removed. Of the n 
new empirical distributions, check if the one with minimum 
KL divergence to the model distribution still satisfies ([3]). 

If ([3]) still holds for all possible empirical distributions 
with one data point removed, check all distinct empirical 
distributions that can be created by discarding 2 data points 
(roughly n 2 possibilities, provided each appears at least 
twice in the data). Continuing in this manner, one would check 
each of the ~ n r " possible empirical distributions that can be 
created by discarding m data points. When (|3]i is first violated, 
to lower bounds the minimum number of data points that must 
be excluded to match the model. We can interpret this as a 
series of integer programs. For to = 0,... ,p define Dfi as 
the solution to 


minimize 

m i ,m.2,... iTn n GN 71 

subject to 



m.i = to 


l 


(4) 


mi < pi i = 1,... ,n 


where pi is the number of times x, appears in the original 
dataset X. The optimization variables, rrii, represent the 


number of samples to discard corresponding to a particular 
Xi. Note that the objective is the KL divergence between the 
new empirical distribution (with m samples removed) and the 
known distribution Q°. The value of Of can be checked in 
Theorem [T] providing conditions under which one can find a 
set \C\ = m such that is not contaminated. This gives 

a bound on c*. Specifically, 


c* > max < to : D* m > -log 

^ p — m 

2 n 

H-log(p — to + 1) 

P — TO 

Note that the condition in Theorem [T] will always be met for 
some to; in particular, for m = p, by convention !)( = 0, 
implying that the empty set, A’{ j, is not contaminated. 

The optimization in (|4| is an integer program over a subset 
of N n . To efficiently solve the optimization, we can translate 
the integer valued variables to their continuous counterparts; 
specifically, let P, = p,/p, be the original empirical distribu¬ 
tion, and a = m/p represent the fraction the total samples 
discarded. Making these substitutions results in a convex 
entropy minimization problem: 



minimize 

PeS™ 

subject to 


? p>log (§) 

Pi < — i=l,...,n 

l — a 


(5) 


where a £ [0,1] represents the fraction of samples removed. 

More generally, Q is a set of distributions. The same 
continuous approximation results in a joint optimization over 
the model space Q and the space of empirical distributions, 
V(X,a) defined in (j2|. Formally, let D* be given as 


D* = min 

PePpf,a),QeQ 



( 6 ) 


If Q is a convex set, the above optimization can be efficiently 
solved in many settings (see Sec. II-Ei. 

To answer our original question and bound a*, one can 
conduct a line search over a £ [0,1], repeatedly solving 
the above optimization, and checking the output value of 
D* a against Theorem [T] This is captured in the following 
proposition. 

Proposition 1. Let 


a L = max { a : D* > —-- log 

P( 1 - a) 


(7) 


2 n 

P{ 1 - a) 


log 0(1 - a) + 1) 


then < a*. 


Proof: The proof follows directly from Theorem [T] For 
any a such that the condition on D* in 0 holds, by Theorem 
[I] any distribution in 'P(X. a) is contaminated. We note that 
a j, always exists by monotone properties of D* a and the right 
hand side of the conditional in (jTJ. See Appendix B, Theorem 
[6] for details. ■ 











weights and the mixture distribution, the optimization takes the 
form 



f ’"’ . 1. Geometric interpretation of Proposition ^ and the optimization in 
with Q = {Q 0 }. The width of the hypercube around P is a. As a is 
increased, the hypercube eventually intersects the ‘outer bound’ set, which 
represents the set of distributions closest to Q° in KL divergence; the sets 
intersect when a = a^. Note that the ‘outer’ bound set also increases in size 
as a increases. 


Fig. [T] shows a geometric interpretation of Proposition [I] and 
the optimization in ([5]). See the caption for details. 

The lower bound obtained by solving the series of optimiza¬ 
tion problems converges to the separation distance, captured 
by the following theorem. 

Theorem 2. Let Q = {Q 0 }. Fix P(X). Then 


k(P\\Q°) - a L = O 



Proof: See Appendix B. ■ 

Theorem [ 2 ] is stated for a fixed P(X), although one would 
in general assume P(X) to be an implicit function of p. The 
reason for fixing P(X) is both generality and simplicity. The 
assumption decouples randomness from the convergence rate 
of the upper bound and the lower bound produced the opti¬ 
mization; without this assumption, the upper and lower bounds 
would be random variables, and necessitate a probabilistic 
statement. We also note that a precise limit statement can be 
readily extracted from the proof. 


E. Discussion 


In practice, it is often the case that the precise model distri¬ 
bution is not known; instead, it may be known that the model 
distribution comes from some family of distributions. This 
arises in anomaly detection when normal events are known to 
correspond to unknown proportions of samples from a finite 
set of distributions. This is the case of the mixture model 
i.e., Q is the set of all distributions that can be represented 
as Q = ttj Q j for any mixture proportions tt ; . As the set 
of mixture distributions with unknown mixture components 
is a convex set, we can directly address this setting using the 


developments of Sec. II-D Jointly optimizing over the mixture 


minimize 

pep n , ires ' 1 


Pi log 


E-=i *iQ\ 


( 8 ) 


We note that the above optimization can be solved at scale in 
real time for many applications; see discussions of numerical 
experiments below for details. 

For many applications, model distributions are generated 
using a finite amount of data from known good sources (i.e., 
sources that are known to have no contamination). Let Q be an 
empirical distribution generated from p' samples of an i.i.d. 
population, and consider the set 


Q! = |Q : Q is typical for ({Q},e)j . 


Here, Q is the set of all distributions that have Q as a typical 
empirical distribution. As before, determining membership in 
Q is intractable for large /:/ and more than two categories. Let 

Q = j<2 : D{Q\\Q) < log y log (p' + 1) j ■ 

Q satisfies two important properties. First, Q C Q by Theorem 
[I] and second, Q is a convex set. 

Solving the optimization in (|6]» with Q = Q provides a 
powerful result which we state in the following proposition. 

Proposition 2. Consider two empirical distributions P and Q. 
Let Q = Q, defined above, and let Dq be the solution to the 
optimization in with a- = 0. If 

D*o > 1 log + — log 0 + 1), 

P W P 


there is no Q that simultaneously satisfies 1) Q is typical with 
respect to Q and 2) P is typical with respect to Q. 

Satisfying proposition [ 2 ] implies that observing a Q and a 
P generated by the same underlying distribution by chance 
can occur at most a fraction e of the time; in this sense, P 
must be contaminated. With a single parameter search over 
a £ [0,1], the lower bound applies: a* > aL- We note that 
the formulation does not require the empirical model and the 
distribution under test to have joint support. 

Numerical experiments were conducted to highlight the 
utility of Proposition [I] results are shown in Fig. [2] In contrast 
to the deterministic experiments in Fig. [2] experiments with 
random samples from various model and test distributions as 
input were run, showing similar convergence behavior. An 
experiment with with Q being a set of 10 mixture distributions 
with n = 50 was also conducted. The line search over a was 
completed using a bisecting search to an accuracy of 2 -28 
(the optimization was solved 27 times for each experiment). 
Averaged over 50 trials, the total time to compute a l was 0.4 
seconds. Experiments were implemented using CVXOPT [3j 
and results visualized with matplotlib Q. 








p 


Fig. 2. Numerical example, n = 11, e = 0.05, Q = {Q 0 }, with Q° 
a uniform distribution over 11 categories. Solid lines show ql divided by 
k(P\\Q°) for mixture distributions fd ip = (i — 7r) Q° + ttU io, where U \o 
is a uniform distribution over 10 of the 11 categories. Dashed lines show a /, 
divided by k(P| |Q°) for = (1 — it)Q° + ir5, where <5 is a point mass. 


III. Related Work 


Related work can be broadly classified into traditional work 
in goodness of fit (GoF) testing, and more recent work in 
anomaly detection. GoF testing has an extensive literature. 
When the data are binary valued, and the model distribution 
Bernoulli, quantifying contamination using GoF tests can be 
addressed by evaluating binomial probabilities (a technique 
known as Fisher’s Exact method @). When the data take 
on more than two values, exact solutions for the level of 
contamination become intractable. 

A customary approach to GoF testing for categorical data 
is Pearson’s y 2 test [6]. This approach to GoF testing can 
be quite powerful, but suffers from limitations, y; 2 tests are 
approximations, and are known to be invalid under certain 
conditions. In particular, the test is invalid when p, = 0 for one 
or more categories. Nonetheless, employing the y 2 test, one 
can deduce another optimization (much as we do in Sec. II-D[ ) 
to answer the aforementioned question; we note the resulting 
optimization is a separable quadratic program with linear 
equality constraints which has an analytic solution 0, and 
would be an interesting starting point for future work. Since 
Pearson’s y 2 test hinges on a normal approximation, this ap¬ 
proach would not result in strict contamination bounds. More 
specific to the contamination estimation problem presented 
here, recent work includes decontamination with multiclass 
label noise 0, 0, which focuses on recovering proportions 
of a set of mixture distributions present in dataset. 

There is an extensive literature on the related topics of 
anomaly detection and outlier detection including work em¬ 
ploying entropy based techniques, in particular jTOj and & 


we note the formulations here are distinct in that the level 
of contamination is not estimated. Lastly, we briefly discuss 
related work in anomaly detection the areas of computer 
networks, systems and security as this is the motivation for our 
developments. Early work on identifying anomalous or unex¬ 
pected behaviors such faults (e.g., due to outages or failures) 
or spikes (e.g., associated with DoS attacks or flash crowds) in 
computer network traffic was based on the application of graph 
models, time series and multi-resolution methods e.g., GD- 
G3- and Principle Components Analysis (PCA) G9-G9- 
There are significant difficulties in tuning these methods to 
provide low false alarm rates in practice US’ necessitating 
methods based on statistical significance, as presented here. 
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Appendix A 


Proof of Theorem [T] requires two ingredients, both relying on results from large deviations theory. The first ingredient is 
Sanov’s Theorem, which we state below. 

Theorem 3. (Sanov’s Theorem) / [20| i (Theorem 11.4.1). Lei S be a set of empirical distributions (with p samples over n 
categories). Then 

Pq(«S) < {p + 1) 71 exp (^-pmin£)(P||(3) N ) . (9) 

V p&s ) 

The second ingredient is also readily derived from results in large deviations theory. 

Theorem 4. Let S be a set of empirical distributions such that P Q(P e ) > P q(P) for all P £ S. Then, 

min D(P\\Q) > D(P e \\Q) - -log(p + l). 

PCS P 

Proof: The following inequalities hold | |2()| (Theorem 11.1.4): 

(p + iy exp {~ pD (P IIQ)) ^ p q(-P) < ex P {-p d (p\\Q)) ■ (!0) 

Thus, for any Pq(P” 1 ) < P q(P 1 ), 

^^exp (-pD(P m \\Q)) < exp (-pP(P f ||Q)) 

which implies the result, completing the proof of Theorem [4] ■ 

Combining Theorems [3] and [4] we have 

P Q ({P\P\...,P 1 }) < (p + l) 2 "exp (^—pD(P l \\Q) s j 

provided Pq(P 1 ) < Pq(P 2 ) < ••• < Pq(P £ ). This provides a simple way to confirm if a sample is contaminated at a 
particular significance level e. In particular, assume P(X ) = P e . If 

(p+l) 2 ”exp(-pP(P(X)||Q)) <e 

or equivalently 

D(P(X)\\Q) > 1 log +—'log(p4-1) (11) 

P \ e J P 

then X is not typical; X satisfies 

P Q ({P\P\...,P l })<e 

and is contaminated with significance e. If CD holds for all Q £ Q, in other words, if 

inf D(P(X)\\Q) > -logf-) +— log(p+l) 

QeC p \e) p 

we conclude then X is not typical with respect to (Q, e), implying the result. 

Appendix B 

Proof of Theorem [2] The proof requires three main steps. The first step is to show that when a is sufficiently close to 
k(P||Q°), the solution to ([cji can be written in closed form. The second step is to show a number of properties regarding the 
asymptotic behavior of oj, as p grows; specifically, o:|, is monotone increasing in p, and converges to the separation distance; 
these properties imply that for large p, the closed form solution is valid. Lastly, we can bound the difference between k(P||Q°) 
and ql using the closed form solution. 

Step 1: For a close to the separation distance (equivalently, for large p, as we show next in Theorem |6j, the optimization 
has a closed form. This is captured in the following Theorem. Note the theorem assumes there is a unique largest in the 
degenerate case when this is not true, the theorem can be restated introducing at most a factor of n, which does not affect the 
final result. 




Theorem 5. Let be ordered such that ^ < ■ ■ ■ < For a £ [1 — (1 — Qe ), k(P||Q)] 


Qn Qk — Qi 

Qi[ 

p* = l 13Q7- 

Pf. 

1 -a 


l^£ 

i = £ 


( 12 ) 


is the unique solution to & 

Proof: The result can be shown by verifying the conditions KKT conditions with 


A* = 



(i -Qi)^h 


i±£ 

i = £ 


(13) 


and 


v* = log 


1 - i 


1 - 


Pi 


- 1 


where the Lagrangian © is given 


as 


L(p,\,v) = J2^°g^ i + J2 x * ( p i~ rz 


1 — a 


E *- 1 


These primal and dual optimal points are derived using methods similar to © (p. 228, 248); in what follows, we simply 
verify the KKT conditions which suffice to complete the proof. First, we confirm that the solution is a stationary point: 


dL(P , A, v) 


dR 


Pi 

= log + 1 + Aj + v 

P » A . , Qi 


= 0 


which holds for all i. The complementary slackness condition is readily verified: 

R 


A * P* - 


1 — a 


= 0, i = 1,..., ri- 


It remains to show conditions under which the solution is primal and dual feasible. First, A| > 0 provided 

Qt{ l ~&) 


(1 - Qi) Y= 


Pi 


> 1. 


After arranging terms, the above holds when a < 1 — ^ = k(P\\Q). The primal equality constraint, P* = 1, is readily 
verified. Lastly, we check the primal inequality constraints. P£ is trivially feasible. For i f £, we require 


JD* _ 


Qi( A) 


1 - i 


< 


1 — a 


which holds when 




Since for all i f £, the solution is feasible if 


Qi — Qk 

We conclude that the KKT conditions are satisfied for the range a specified in the statement of the theorem. Since the objective 
is strictly convex the solution is unique, which completes the proof. 

■ 

Step 2: We show that as p approaches infinity, a approaches the separation distance. More specifically, we have the following 
theorem. 















Theorem 6. 


a l < k(P||Q°) and 


lim a L = k(P\\Q°) 


p—too 


We begin the proof by examining the behavior of D* a and a^. Note that D* a (the minimizer of ([ 6 ])) is monotone non¬ 
increasing in a, as increasing a relaxes the constraints. For a = k(P\\Q°), D* a = 0 as the constraints allow P, = (}, for all i 
(as KL divergence is minimized if and only if Pi = Qi for all i). Define 



for a £ [0,1], p > 0. We can write (|7|) as 

a L = max {a : D* a > 7 L (a,p)} 

For fixed p, 77,(07 p) is strictly increasing in a for a £ [0,1]. This (and since D* is monotone non-decreasing in a) implies 
existence and uniqueness of for fixed p. Next, for fixed a , 77 ( 0 :. p) is strictly decreasing in p. Since D* a is not a function 
of p, we conclude that 0 : 1 , is non-decreasing in p. 

Lastly, to prove the limit statement, we require D* a be left continuous at a = k(P\ |Q°); for any e > 0, there exists some 
S > 0 such that -D* ^ < (■ This follows as the objective is continuous in the optimization variables, and constraints are 

continuous in a\ an arbitrarily small increase in the objective can be realized by sufficiently reducing a. 

Step 3: Bound n(P\\Q) — 07 using the closed form solution. 

The value of KL divergence at P* from (fj~2j> is 



< 


Qe{ 1 - Qe ) 



(1 — Qf)(l — a) 2 


where the inequality follows since log(x) < x — 1. We are ready to bound the difference between a l and the separation 
distance. Recall the definition of 07 ; c^l must satisfy 



and by (14 1 



which implies the result 




















